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Abstract. In computational phylogenetics, the problem of constructing a supertree of 
a given set of rooted input trees can be formalized in different ways, to cope with 
contradictory information in the input. We consider the Minimum Flip Supertree problem, 
where the input trees are transformed into a 0/1/?- matrix, such that each row represents 
a taxon, and each column represents an inner node of one of the input trees. Our goal is 
to find a perfect phylogeny for the input matrix requiring a minimum number of 0/1-flips, 
that is, corrections of 0/1-entries in the matrix. The problem is known to be NP-complete. 
Here, we present a parameterized data reduction with polynomial running time. The data 
reduction guarantees that the reduced instance has a solution if and only if the original 
instance has a solution. We then make our data reduction parameter-independent by using 
upper bounds. This allows us to preprocess an instance, and to solve the reduced instance 
with an arbitrary method. Different from an existing data reduction for the consensus tree 
problem, our reduction allows us to draw conclusions about certain entries in the matrix. 
We have implemented and evaluated our data reduction. Unfortunately, we find that the 
Minimum Flip Supertree problem is also hard in practice: The amount of information that 
can be derived during data reduction diminishes as instances get more "complicated" , and 
running times for "complicated" instances quickly become prohibitive. Still, our method 
offers another route of attack for this relevant phylogenetic problem. 



1 Introduction 

When studying the relationship and ancestry of current organisms, discovered relations 
are usually represented as phylogenetic trees, that is, rooted trees where each leaf 
corresponds to a group of organisms, called taxon, and inner vertices represent 
hypothetical last common ancestors (or latest common ancestor) of the organisms located 
at the leaves of its subtree. 

Supertree methods assemble phylogenetic trees with non-identical but overlapping 
taxon sets, into a larger supertree that contains all taxa of every input tree and describes 
the evolutionary relationship of these taxa [3]. Constructing a supertree is easy if no 
contradictory information is encoded in the input trees [HE]. The major problem of 
supertree methods is dealing with incompatible data in a reasonable way, where it should 
be understood that incompatible input trees are the rule rather than the exception in 
phylogenetic supertree analysis. 

Matrix representation (MR) supertree methods encode inner vertices of all input trees 
as partial binary characters in a matrix, which is then analyzed using an optimization 
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or agreement criterion to yield the supertree. In 1992, Baum [2 J and Ragan [T7] 
independently proposed the matrix representation with parsimony (MRP) method as 
the first matrix representation method, that performs a maximum parsimony analysis 
on a matrix representation of the input trees. MRP is by far the most widely used 
supertree method today, and constructed supertrees are of comparatively high quality. 
The Maximum Parsimony problem is NP-complete [TT], and so is the MRP problem. 

The matrix representation with flipping (MRF) supertree method also uses a matrix 
representation of the rooted input trees, with matrix entries '0', '1', and '?' [SJ. Utilizing 
the parsimony principle, MRF seeks the minimum number of "flips" — > 1 or 1 — > 
in the input matrix that make the resulting matrix consistent with a phylogenetic tree, 
where '? '-entries can be resolved arbitrarily. Evaluations by Chen et al. [8] indicate that 
MRF is on par with the "gold standard" MRP, and superior to other approaches for 
supertree construction. Most supertree methods take rooted trees as input, and so does 
MRF; but this is not a problem in practice, as in practically all relevant cases, input 
trees can be rooted by an outgroup. 

If all input trees share the same set of taxa, the supertree is called a consensus tree [1] . 
As for supertrees, we can encode the input trees in a matrix, here with matrix entries 
'0' and '1'. In case there exist no conflicts between the input trees, we can construct 
the corresponding perfect phylogeny in &(mn) time for n taxa and m characters |13j . 
To deal with incompatible input trees, the MRF consensus tree problem again seeks the 
minimum number of flips in the input matrix to reach a perfect phylogeny. This problem 
is NP-hard [S], but there have been some recent algorithmic results: The problem can be 
approximated with approximation ratio 2d where d is the maximum number of ones in a 
column |9j . This approximation ratio is obviously prohibitive in practice, but no constant 
factor approximation is known. On the parameterized side, let k denote the number of 
flips required to correct the input matrix: Komusiewicz et al. [14] give a problem kernel 
with 0(k s ) vertices for the MRF consensus tree problem, and Bocker et al. [5] present a 
0(4.83 fc + poly(m,n)) search tree algorithm. 

For the more general MRF supertree problem, there has been less progress: Clearly, 
the MRF supertree decision problem is NP-complete, as it generalizes the MRF consensus 
tree problem, and we can check in polynomial time if a given binary matrix M* is a 
perfect phylogeny and has distance at most k to our input matrix. We can test whether an 
MRF supertree instance admits a perfect phylogeny without flipping in time 0(mn) [16j . 
There exist no approximation algorithms or parameterized algorithms in the literature. 
Chen et al. [8] present a heuristic for MRF supertrees based on branch swapping, and 
Chimani et al. [10] introduce an Integer Linear Program (ILP) to find exact solutions. 
Recently, Bocker et al. [6] presented a heuristic top-down algorithm based on the MRF 
intuition, namely the FlipCut supertree method, which is both swift and accurate in 
practice. 

Our contributions. Here, we present a set of reduction rules that can be applied to an 
arbitrary instance of the MRF supertree problem, requiring polynomial running time. 
Our data reduction is parameterized, in the sense that we assume a maximal number 
of flips k to be given. The data reduction guarantees that the reduced instance has a 
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solution if and only if the original instance has a solution. We then show how to make 
the reduction parameter- independent, by using upper and lower bounds. This allows us 
to preprocess an instance, and to solve the reduced instance with any method, be it 
an ILP, a search tree algorithm, or a heuristic. Different from |14| . our data reduction 
allows us to draw conclusions about certain entries in the input matrix, whereas the 
data reduction for MRF consensus trees in [14] only removes certain characters and taxa 
from the input. 

We have implemented and evaluated our data reduction on a set of MRF supertree 
instances from [8J. Unfortunately, we find that running times become prohibitive when 
instances become large, or contain many '?'. This agrees with findings in [TO], where 
"complicated" instances could not be processed by the ILP in reasonable running time. 
Still and all, we believe that the data reduction presented here, can be an important step 
towards both exact methods and improved heuristics for the MRF supertree problem. 

2 Preliminaries 

Let n be the number of taxa and m be the number of characters or features. For brevity, 
we assume that our set of characters equals {1, . . . , m}, and that our set of taxa equals 
{1, . . . , n}. Each taxon t can possess or not possess each character v, encoded in a binary 
n x m matrix M, where columns of M correspond to characters and rows correspond 
to taxa. For the moment, we do not allow '?' to appear in the input matrix. Under 
the classical perfect phylogeny model [18], we assume that there exists an ancestral 
species that possesses none of the characters, corresponding to a row of zeros. We further 
assume that each transition from '0' to '1' happens at most once in the tree: An invented 
character never disappears and is never invented twice. We say that M admits a perfect 
phylogeny if there is a rooted tree with n leaves corresponding to the n taxa, where for 
each character u, there is an inner node w of the tree such that M[t, u] = 1 holds if and 
only if taxon t is a leaf of the subtree below u, for all t. 

Given an arbitrary binary matrix M, we may ask whether M admits a perfect 
phylogeny. Gusfield [13J shows how to test M and, if possible, construct the corresponding 
phylogenetic tree in time 0{mn). There exist several characterizations for such 
matrices [16j, of which we only mention two here. Let Im(v) '■= {t : M[t,t>] = 1} 
be the set of 'l'-indices in column v. Matrices that admit a perfect phylogeny, can be 
characterized via the pairwise compatibility of all column pairs u, v: That is, Im{u) C 
Im{v) or Im(v) Q Im(u) or Im{u)PiIm{v) = must hold. Characters that do not satisfy 
this condition are said to be in conflict. We can also characterize such matrices via local 
conflicts: Let Q{M) be the bipartite graph on character vertices v € {1, . . . , m} and taxa 
vertices t £ {1, . . . , n}, such that an edge [y , t) exists if and only if M[t, v] = 1. Now, 
M admits a perfect phylogeny if and only if the graph Q{M) is M-free, that is, it does 
not contain an induced path of length four starting from and ending in different taxa 
vertices [0J. 

We consider two variants of Matrix Representation with Flipping problems, namely 
the Minimum Flip Consensus Tree (MFCT) and the Minimum Flip Supertree 
(MFST) problem. For the MFCT problem, consider a set of binary rooted trees on the 
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same set of n taxa. We encode the input trees in a binary matrix M, where each column 
corresponds to an inner node in one of the trees, and an entry '1' indicates that the 
corresponding taxon is a leaf of the subtree rooted in the inner node. We ask for the 
minimum number of modifications ("flips") to M such that the resulting matrix admits 
a perfect phylogeny. We refer to this number of flips as the cost of the instance. 

The more general MFST problem arises when the input trees have overlapping but 
not necessarily identical taxa sets. In this case, for characters belonging to a particular 
input tree, the state ('0' or '1') of some taxa is not known as they are not part of the 
input tree, and represented by a question mark ('?'). We ask for a perfect phylogeny 
matrix M* such that the number of entries where one matrix contains a '0' and the 
other matrix a '1', is minimal. This is the number of flips required to correct the input 
matrix M, whereas '?'-entries can be resolved arbitrarily. Note that a perfect phylogeny 
matrix must not contain '?' entries. Both for MFST and MFCT, we usually have n <C m. 

Throughout this paper, we assume that the input matrix M does not contain any all- 
zero columns: If the matrix would contain such columns, we could simply remove them. 
We infer that any optimal solution does not contain an all-zero column: Otherwise, we 
could leave one of the entries in the flipped matrix M* in its original state '1', thereby 
constructing a matrix that is also a perfect phylogeny but requires less flipping. This 
follows because a character that is exhibited by a single taxon, cannot be in conflict with 
any other character. We make use of the fact throughout this paper without explicitly 
referring to it. 

3 The inclusion graph 

Given an instance M e {0, i} nxm of the Minimum Flip Consensus Tree problem, we 
say that two characters u and v are in conflict if Im(u) ^Im(v) but Im(u) $Z Im(v) 
and Im(v) % Im(u). We define the inclusion graph G = (V,E) as follows: This graph 
has vertex set V := {1, . . . , m}, being the characters of matrix M. Two vertices u, v € V 
can be connected via a directed edge (u,v), or by an undirected edge uv = {u,v}. An 
inclusion edge (u, v) from u € V to v € V is present if Im{u) C Im{v). A disjoint edge uv 
connecting u, v G V is present if Im{u) H Im{v) = 0- Any two vertices u, v are connected 
by either no edge in case u, v are in conflict; by a single edge (u,v), (v,u), or uv; or, by 
two inclusion edges (u, v ) and (v, u) at the same time. 

If two vertices u, v € V are connected by both directed edges (u, v) and (v,u), then 
u and v have the same neighborhood Im{ u ) = Im(v)- In this case, there exists an 
optimal solution such that u and v also have the same neighborhood [2]. There may 
also exist optimal solutions such that u and v have different neighborhoods, but this 
case is somewhat pathogenic and will rarely appear in practice. In view of this, we can 
immediately merge u and v. In order to merge nodes in the inclusion graph, we assume 
that each character vertex v £ V and each column of the matrix M has a weight assigned 
to it, representing its multiplicity. For readability, we omit these simple details in the 
following. Now, we may assume that any two vertices u, v G V are connected by at most 
one edge. 
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We know that M admits a perfect phylogeny if and only if there exist no two vertices 
u, v € V that are in conflict. This is the case if any two vertices in the inclusion graph are 
connected by (at least) one edge. When resolving conflicts in the matrix M , this will lead 
to induced changes in the inclusion graph. This allows us to reformulate our problem: 
We search for the minimum number of changes in M, such that any two vertices in the 
inclusion graph are connected by (at least) one edge. 

If M admits a perfect phylogeny, then the resulting graph is transitive: from (u, v) G 
E and (v , w) S E we infer (u, w) E E. But we can derive a similar deduction rule for 
disjoint edges: from (it, v) £ E and vw € E we infer uw € E. We say that an inclusion 
graph is tree-ish if it satisfies these deduction rules for all vertices. 

In applications, it is of no avail to actually compute the inclusion graph of a matrix M, 
as we can compute on the fly whether an edge is present or not, using M. Still, the 
inclusion graph is useful in applications: During data reduction, we sometimes learn 
that, say, Im(u) C Im(v) must hold for the optimal solution. In this case, we set the 
respective edge to "permanent". More often, we will learn from the data that, say, 
Im(v) f= Im{v) cannot hold for the optimal solution. In this case, we set the respective 
edge to "forbidden". Note that forbidden edges may co-exist in parallel for one vertex 
pair u, v. But in case two out of the three edges (u, v), (v, u), and uv are set to forbidden, 
we can immediately set the remaining edge to permanent. 

The inclusion graph, in turn, allows us to draw conclusions about entries in M: If 
there is a permanent edge (u, v ) in the inclusion graph, and we decide to change or keep 
an entry M[t, u] = 1 in our input matrix, this forces us to also set M[t, v] = 1. Similarly, 
if we decide to change or keep an entry M[t, v] = then the edge (u, v) in the inclusion 
graph also forces us to set M[t, u] = 0. We will formalize these observations in the next 
section. 

Note that we can define a similar inclusion graph for an instance M E {0, 1, ?} nxm 
of the Minimum Flip Supertree problem. Here, M admitting a perfect phylogeny M* 
does not imply that every two vertices in the inclusion graph are connected by an edge: 
For example, an input matrix containing solely '?' results in a inclusion graph without 
edges. But our other reasoning introduced above, remains valid. 

4 Parameterized data reduction 

We now describe data reduction rules for the Minimum Flip Supertree problem. Here, 
entries in the matrix M can be '?', and we have to assure that such entries are chosen 
"conservatively": To this end, we define I* M {v) := {t : M[t, v] G {1,?}}. 

We take a parameterized view of the problem: We assume that we are given an integer 
k, and we want to know if there exists a solution for input matrix M with cost at most k. 
This will allow us to set certain edges of the inclusion graph to forbidden or permanent, 
and also to permanently set certain entries in the matrix M, which may include resolving 
'?'-entries or even flipping entries in the matrix. We will see in Sec. [S]how these rules 
can be applied during preprocessing. 

For u,v € V we set N(u - v) := Im{u) \ I*m{v) and N(u + v) := Im(u) n Im(v). 
Recall that (u, v) being present in the inclusion graph of an optimal solution M* E 
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{0, l} nxm ; implies that Im*(u) C Im*(v) must hold. Similarly, -ui> being present implies 
that Im*(u) n Im*(v) = 0. As we assume that the distance between M* and M is at 
most k flips, we can easily deduce two simple reduction rules: 

Rule 1. If \N(u — v)\ > k then set (u, v) to forbidden. 
Rule 2. If |JV(u + f )| > k then set uv to forbidden. 

Note that the first rule is two-sided, as edges (u, v ) are directed. In case two of the 
three possible edges (u,v), uv, (v,u) between vertices u, v have been set to forbidden, 
we set the remaining edge to permanent. If an edge is set to permanent and forbidden 
simultaneously or, equivalently, if all three edges (u,v), uv, (v,u) are set to forbidden 
simultaneously, then the instance has no solution with cost at most k. In case entries 
in M have been permanently set, we can extend these rules as follows: We assume 
\N(u — v)\ = oo if both M[t, u] = 1 and M[t, v] = are permanent for some taxon t; and 
\N{u + v )| = oo if both M [t, u] = 1 and M[t, v } = 1 are permanent for some taxon t. 

On the other hand, we can use permanent edges in G to derive information about 
entries in M: Keeping or setting some entry M[t, u], will require us to also change other 
entries in M. The next three rules follow immediately: 

Rule 3. If M[t, u] = 1 is permanent and (u,v) is permanent in G, then permanently 
set M[t,v] = 1. 

Rule 4. If M[t, v] = is permanent and (u,v) is permanent in G, then permanently 
set M[t,u) = 0. 

Rule 5. If M[t, u] = 1 is permanent and uv is permanent in G, then permanently set 
M[t,v] = 0. 

Again, if an entry M[t,u] is permanently set to '0' and '1' simultaneously, then the 
instance has no solution with cost at most k. 

Based on these observations, we can test in advance if the instance still allows to 
permanently set an entry of the matrix to '0' or '1'. The induced cost one for entry 
M[t,u], denoted ico{t,u), is the number of vertices v € V such that (u, v) is permanent 
and M[t, v] = 0, plus the number of vertices w £ V such that uw is permanent and 
M[t, w] = 1. Similarly, we define the induced cost zero for entry M[t, v], denoted icz{t, v), 
as the number of vertices u E V such that (u, v) is permanent and M[t, u] = 1. We also 
take into account if the entry M[t, v] is currently set to '0', '1', or '?'. To this end, 
we define ico*{t,u) := ico{t,u) + 1 if M[t,u] = 0, and ico*{t,u) := ico{t,u) otherwise. 
Similarly, we define icz*(t,v) := icz(t,v) + 1 if M[t, v] = 1, and icz*(t,v) := icz(t,v) 
otherwise. 

Rule 6. If ico*(t,u) > k then permanently set M[£,ii] = 0. 
Rule 7. If icz*(t,v) > k then permanently set M[f,v] = 1. 

We can do the inverse reasoning of Rules [3HS] and reach: 

Rule 8. If M[t, u] = 1 is permanent and M[f,u] = is permanent then set (u, v) to 
forbidden. 
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Rule 9. If M[t, u] = 1 is permanent and M[t, v] = 1 is permanent then set uv to 
forbidden. 

Finally, we can use the fact that the inclusion graph must be tree-ish: 

Rule 10. If (u, v ) is permanent and (v , w) is permanent then set (u, w) to permanent. 
Rule 11. If (u,v) is permanent but (u,w) is forbidden then set (v,w) to forbidden. 
Rule 12. If (v, w) is permanent but (u,w) is forbidden then set (u,v) to forbidden. 
Rule 13. If (u, v) is permanent and vw is permanent then set uw to permanent. 
Rule 14. If (u, v) is permanent but uw is forbidden then set vw to forbidden. 
Rule 15. If vw is permanent but uw is forbidden then set (u, v) to forbidden. 

Finally, we can get rid of characters exhibited by a single taxon: 

Rule 16. If a column in M contains at most one '1' entry, then remove this column. 

Given an instance of MFST, we apply the above data reduction rules until the 
conditions of none of the rules are met. Whenever we change an entry of the matrix 
M by the above rules, we can lower our parameter k by one which, in turn, may allow 
us to apply other rules. Still, the complete data reduction requires only cubic time: 

Theorem 1. Rules [JUl6\ are correct, and can be carried out to completion in 0((m + 
n)m 2 ) time. 

Proof. From the reasoning above, it is quite obvious that all rules are correct. So, we 
focus on the running time of the data reduction. 

Given an instance M of the MFST problem, we first compute the inclusion graph in 
time 0(m 2 n). Note that in the matrix M, at most 0(mn) entries can be permanently 
set to '0' or '1' during the course of the data reduction. Similarly, at most 0(m 2 ) edges 
can be set to forbidden or permanent in the inclusion graph. Whenever we permanently 
flip an entry in M, we lower our cost bound k by one. 

Initially, we compute q(u — v) := \N(u — v)\ and q{u + v) := \N(u + v)\ for all u,v in 
time 0(m 2 n). Now, we can test Rules [TH2] in constant times for each pair u,v. Similarly, 
we compute ico*(t,v) and icz*{t,v) for all v,t in time 0(m 2 n), what allows us to test 
Rules EH7] in constant time for each pair t, v. During the course of our data reduction, 
the parameter k will change, so we have to efficiently find those pairs u, v or t, v that 
allow to use one of these rules. For each value 0, . . . , k as well as all values > k we use an 
individual bin, and we use double-linked lists to access those pairs that allow application 
of the above rules. Updating q(u — v), q{u + v), ico*(t,v), and icz*(t, v) can still be 
performed in constant time. So, in constant time we can find a pair u, v or t, v to apply 
a reduction rule, or decide that no such pair exists. 

All other rules are only applied if a matrix entry is permanently set or flipped, or if 
an edge in the inclusion graph is set to forbidden or permanent. For each rule, we now 
analyze under what circumstances it can be applied, an what time is required to apply 
the rule. 

For Rules HH21 w e have to update q(u — v) and q(u + v) every time an entry in the 
matrix is flipped. Assume that M[t, u] is the matrix entry being flipped, then we check 
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for all v 7^ u, whether q(u — v), q(v — u), or q(u + v) must be updated. In this case, these 
values are increased or decreased by one, depending on the entry M[t, v]. In total, a flip 
in the matrix M requires 0(m) time to update all q(u — v) and q(u + v). 

Rules [3HS] must be applied if either a matrix entry is permanently set, or if an edge 
is set to forbidden or permanent. Regarding RuleO assume that M[t,u] is permanently 
set to '1'. In this case, we have to test for all v ^ u if (u, v) is permanent in the inclusion 
graphs, what can be done in time 0(m). Now, assume that some edge (u,v) is set to 
permanent. Then, we have to check all taxa t if M[t, u] = 1 is permanent, what can be 
done in time O(n). A similar reasoning applies for the other two rules. 

Rules EH3 require us to update ico*(t, v) and icz*(t, u) whenever either a matrix entry 
is flipped, or an edge is set to permanent. Regarding ico*(-), assume that entry M[t, v] has 
been flipped to '0'. Then, for all u ^ v such that (u, v) is permanent, we increase ico*(t, u) 
by one. Similarly, for all w ^ v such that vw is permanent, we decrease ico*(t,w) by 
one. If M[t, v] has been flipped to '1' then we do the same, exchanging increase and 
decrease. This can be carried out in time 0{m). If an edge is set to permanent, we can 
update all affected entries in time O(n). A similar reasoning applies for the computation 
of icz*{t, u). 

For Rules HHH] we have to update edges in case an entry M[t, v] is flipped: Then, we 
have to consider all entries M[t, u] for u/w what can be done in time 0(m). 

Rules [T0HT51 update edges in case some edge between u and v is set to forbidden or 
permanent: Then, we have to consider all vertices w ^ u,v what requires 0(m) time. 

Applying the above rules, may result in more than one "update operations" to be 
carried out. For that, we can keep all such update operations on a stack, and carry out 
the next update operation only after we have finished the current one. 

We conclude that permanently setting an entry of the matrix requires 0{m) time 
for checking all of the rules. Since we can permanently set at most 0(mn) entries, this 
requires 0(m 2 n) time in total. Similarly, setting an edge to permanent, requires 0{m+n) 
time for checking our rules. Since there are 0(m 2 ) edges the total running time becomes 
0{(m + n)m 2 ). This results in a running time of 0((m+n)m 2 ) for the full data reduction. 

□ 

If we reach a conflict in our data reduction, such as permanently setting some M[t, v] 
to and 1 at the same time, then we infer that there exists no solution of the instance 
of cost at most k. 

5 Upper and lower bounds 

We will now describe a lower bound for the Minimum Flip Supertree problem, which 
we will use to derive improved versions of Rules [TH2] and EH3 A local conflict consists of 
two characters u,v G {1, . . . , m} and three taxa t%,t2, t% G {1, . . . , n} such that M[t\,u\ = 
M[t 2 ,u] = M[t 2 ,v] = M[t 3 ,v] = 1 but M[t 3 ,u] = M[t x ,v] = 0. In the Minimum Flip 
Consensus Tree setting, M admits a perfect phylogeny if and only if M does not 
contain a local conflict [S]. For MFST, we can only reason that if M contains a local 
conflict, then it does not admit a perfect phylogeny. 
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We now use local conflicts to compute a lower bound for the costs of an instance M: 
We say that two local conflicts are edge- disjoint if the local conflicts do not contain a 
common tuple (v,t). The term "edge-disjoint" stems from visualizing the matrix M as 
a bipartite graph [9], as noted in Sec.[2j Let C be a set of edge-disjoint local conflicts in 
M. Now, for every element in C we have to make at least one modification to the matrix 
M to remove the local conflict, so \C\ is a lower bound to the cost of an optimal solution. 
Unfortunately, it is not obvious how to efficiently find a set C of maximal cardinality: 
For example, the obvious transformation to a graph leads to the NP-hard Maximum 
Independent Set problem. In case columns of M have been weighted, we can follow 
a greedy strategy, choosing a local conflict that maximizes the cost of the current step. 
We can also weight each local conflict by the (inverse) number of other local conflicts 
it has at least one common edge with. Finally, we can restart the algorithm several 
times, choosing a random local conflict in each step, and maximize over these bounds. 
In theory, we can compute another lower bound by solving the relaxation of the Integer 
Linear Program presented in [TO], but this is usually too slow in practice. 

Testing all characters u,v and taxa t\,t%,t% is prohibitive in application, as this 
requires 0(m 2 n 3 ) time. But we can use sets N(u — v) and N(u + v) for this purpose: Two 
characters are in conflict if all sets N(u—v), N(u + v), and N(v — u) contain at least one 
element. We now describe an improved algorithm for the greedy strategy and its variants 
discussed above. Initially, we compute the cardinality q(u — v) and q(u + v) in 0(m 2 n) 
time, storing q requires 0(m 2 ) space. We start with a set L = {{u, v} : 1 < u < v < m} 
of character pairs that are potentially in conflict. We then select a pair {u, v} from £, 
either randomly or by some other criterion. If u, v are no longer in conflict, we remove 
{u, v} from C. Otherwise, we choose a certain local conflict u,v,ti,t2,ts to be part of 
our set C: We then update, for each tuple (w, t) of the local conflict, w S {u, v} and 
t € {ti, tzjts}, all cardinalities q(w — w'), q(w + w'), and q(w' — w) for all characters 
w' in time 0(mn). Let k opt be the cost of an optimal solution, then \C\ < k opt < ran. 
All updates require 0{\C\mn) time and, hence, 0{k O ptmn) time. The whole procedure 
requires only O(k opt mn + m 2 n) time in total. In practice, we can speed up calculations 
by initializing C with those tuples {u, v} that are initially in conflict. Also, note that our 
data reduction requires us to maintain cardinalities q before we enter the computation 
of a lower bound. 

We now use a trick introduced in [4J to lift a local reduction rule to a global version: 
Rules HH2] are local, in the sense that these rules only take into account entries M[t,u] 
and M[t, v] for all taxa t. Similarly, computing ico*(t, v) and icz*(t, v) will consider 
entries M[t, w] for all characters w, but ignore the rest of the matrix. As all other rows 
or columns of M have to be cleaned of local conflicts at a later stage, it makes sense to 
estimate the cost for doing so using a lower bound. 

Let Z&Af(t) be any lower bound where, during the calculation of this bound, the row 
of M corresponding to taxon t is not taken into account. Similarly, we write 11>m{u,v) 
for two ignored character columns u, v. Now, we can write improved versions of Rules 
[TOJand[6H3 

Rule 17. If \N(u — v)\ + Wm(u,v) > k then set (u,v) to forbidden. 
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Rule 18. If \N(u + v)\ + lbM(u,v) > k then set uv to forbidden. 
Rule 19. If ico*(t,v) + /&m(^) > k then permanently set M[t, v] = 0. 
Rule 20. If icz*{t,u) + Ib^it) > k then permanently set M[i,tt] = 1. 

The correctness of these rules follows immediately. Unfortunately, we have to compute 
an individual lower bound Wm{u, v) for every pair u, v and lbM(t) for every t. To further 
speed up calculations, we can initially compute a lower bound Ibu of the complete 
instance, and calculate Wm(u,v) only for those pairs where \N(u — v)\ + Ibu > k or 
|iV(it + v) \ + Vom > k holds. Note that there may exist rare cases where our lower bond 
computations are not monotonous, so that Wm(u, v) > Ib^j, and we will miss a rule that 
could have been applied. We expect this to be negligible in practice. A similar reasoning 
applies for lbM(t)- 

The above rules still depend on parameter k. To reach a parameter-independent 
data reduction, we have to choose an appropriate k: To this end, note that the cost 
of any heuristic solution to an instance, are always an upper bound to the cost of an 
optimal solution. So, we can choose any heuristic to compute an appropriate k, and then 
apply our parameter dependent data reduction, using a lower bound for Rules [TTH201 
It must be understood that for practically all real-world instances, the cost k computed 
by any heuristic will be too large to directly apply Rules HH2] and EH3 Only through 
our algorithm engineering technique of using lower bounds, we can successfully start our 
data reduction. Rules [191 and 1201 will sometimes allow us to lower the cost k and, hence, 
the complexity of the remaining instance. 

Chen et al. [8] have introduced an involved heuristic for the problem that, much like 
heuristics for the Maximum Parsimony problem, is based on exploring tree space via 
branch swapping. This heuristic is rather time-consuming and can require minutes or 
even hours of running time, but its results are of excellent quality |10j . Another upper 
bound can be computed by running the ILP from |10j for some time, and stop after a 
fixed time before upper and lower bound of the instance coincide. 

6 Experiments 

We implemented all evaluated algorithms in Java. Computations were performed on an 
AMD Opteron-275 2.2 GHz with 6 GB of memory running Solaris 10. 

We now evaluate the parameter-independent data reduction. As indicated in the 
introduction, we can use our reduction as a preprocessing step, and solve the reduced 
instance with any exact, approximation, or heuristic algorithm. To evaluate the 
performance of our data reduction, we use different measures: 

— We calculate the ratio of fixed entries, '0/1' entries that are not flipped by the data 
reduction but set to permanent, relative to the number of '0/1' entries in the input 
matrix. 

— Similarly, we calculate the ratio of flipped entries, which are always permanent. 

— We calculate the ratio of resolved entries, '?' entries that are permanently set to 
'0/1', relative to the number of '?' entries in the input matrix. 
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0.51 
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12.26 


9.88 


8.78 


1.11 


Running time (h:min) 


0:50 


5:40 


22:51 


24:08 



Table 1. Results of the data reduction for 25% taxa deletion. Averages over 100 
instances. All numbers except "number flips" and "running time" in percent. 



— We calculate the ratio of permanent entries in the matrix, relative to the size mn of 
the matrix. 

— Next, we count how many edges in the inclusion graph have been set to permanent, 
and compare this to the (™) possible edges. 

— For all pairs u, v where no permanent edge exists, we count the number of forbidden 
edges, and compare this to the 3(™) possible forbidden edges. 

— Finally, we calculate the number of flips executed by the data reduction, and compare 
it to the number of flips required to solve the instance. This reduces the cost and, 
hence, the complexity of the resulting instance. 

For our evaluation, we use instances generated by Eulenstein et al. [12], see there 
for details. These simulated datasets are very similar to a regular phylogenetic supertree 
study, yet for each dataset we know the true model tree behind the data. Unfortunately, 
running times of our data reduction are currently prohibitive for larger instances as 
well as instances with a large fraction of '?'. To this end, we concentrate on matrices 
containing 25% '?'-entries, generated from s = 4, 6,8 input trees. The number of taxa 
n is either 48 or 96. These matrices contain about m ~ (pn — 2)s columns, where 1 — p 
is the ratio of '?'-entries. For each parameter combination, we choose the 100 instances 
named "random", for which deleted taxa were randomly chosen. We use the heuristic 
solutions from |8] as upper bounds for our parameter-independent data reduction, and 
the randomized lower bound from Sec. with 100 repetitions. 

One can see that reduction ratios deteriorate for both increasing number of input 
trees, and increasing number of taxa. We expect this to be even more so for higher ratios 
of taxa deletion. Currently, the limiting factor are the high running times of the data 
reduction. On the other hand, we observe that the data reduction truly does reduce the 
instances. This is a clear indication that with an improved implementation, algorithm 
engineering, new data reduction rules, and an improved lower bound, we may indeed 
simplify MFST instances in polynomial time. 
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7 Conclusion 

We have presented a set of data reduction rules, that allow us preprocess instances 
of the Minimum Flip Supertree problem, and also of the "simpler" Minimum 
Flip Consensus Tree problem. Our data reduction can be applied in polynomial 
running time. Different from [I4J, our reduction allows us to draw conclusions about 
certain entries in the input matrix. This is highly desirable, as flipping entries during 
preprocessing means that we are reducing the cost of the resulting instance: Chimani et 
al. |10] found that ILP running times are strongly correlated with the optimal number 
of flips. 

Our method allows us to draw conclusions about MFST instances, guaranteeing 
both polynomial running time and optimality of the solution. On the practical side, the 
output of our method can be subsequently processed with any method, including fast 
heuristics. Unfortunately, our reduction is currently not suited for real- world application, 
as running times are prohibitive and reduction results are minor. Still, we think that 
this an important first step towards a data reduction that is applicable in practice. 
An improved data reduction may be ultimately combined with heuristics to obtain a 
supertree method that is both fast and accurate in practice. 

Note that the inclusion graph can also be used as an algorithm engineering technique 
for the MFCT search tree algorithm from [5J. We conjecture that these techniques will 
make the search tree algorithm much faster in practice. Unfortunately, it appears this 
cannot be used to improve upon the worst-case running time of the algorithm. 

We conjecture that our data reduction from Sec. 0] can be used as part of a problem 
kernel for the MFCT problem. From the theoretical side, it is an interesting open 
question if this allows us to find a better than cubic kernel [2] . Finding a kernel for the 
MFST problem, on the other hand, is related to the open question whether MFST is 
parameterized tractable. 

Acknowledgment. Implementation by Konstantin Riege and Andreas Dix. 
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